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Overview 

The age of D3D12 & Vulkan has begun! 
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Caveat emptor 

• D3D11 drivers are really well optimized 

• Use your knowledge to outsmart & outperform 
the D3D11 driver 

• D3D12 was not invented to write a legacy API 
driver on top 

• Other issues 



D3D12 booster 


Your engine 


Vulkan booster 
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Stage 0 


Game 


"The Tenderer" 

IRenderSystemD3D9- 

LookAlike 



OpenGL 


D3D11 


D3D9, OpenGL ES, etc. 
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Stage 0.5 


Game 

"The Tenderer" 

IRenderSystemD3Dll- 

LookAlike 





OpenGL 


D3D11 
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Game 

"The Tenderer" 

IRenderSystemD3Dll- 

LookAlike 


D3D12 & Vulkan 















GDC 




GAME DEVELOPERS CONFERENCE March 14-18, 2016 • Expo: March 18-18, 2016 #GOC16 


Stage 2 


Game 

"The Tenderer" 

IRenderSystemD3D12- 

LookAlike 


Not high level enough 


Not low level enough 



▼ 


D3D11 


D3D12 & Vulkan 
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OpenGL 


D3D11 


D3D12 


Vulkan 






















State of the nation 


• Engines are transitioning to support Vulkan and D3D12 

• D3D11 support still required 

• Most are midway between Stage 1 and 2 

• Lots of thought needed to get the best out of all APIs 

• Multi-queue support requires additional work 

• Needs to scale down to D3D11 

• Targeting D3D12/Vulkan and running on D3D11 is the 
recommended way 
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Design for the future 

• I'll point out common design 
issues 

• Get your engine ready 

• Turn your knowledge into better 
performance 


Design 
f i rst ! 
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Barrier control 

• Barriers are a new concept in D3D12/Vulkan 

• Sad truth: Everyone gets them wrong 

• Two failure cases: 

• Too many or too broad: Bad performance 

• Missing barriers: Corruptions 

• D3D11 driver does this under the hood - and quite well 
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What's a barrier, anyway? 


Z-Buffer 


Ambient occlusion 

Transition J 





Render target to texture 

• Probably a decompression is needed (& cache 
flush) 

• What will happen changes between vendors 
and GPU generations - can be a no-op, can be 
a wait for idle, can be a full cache flush 
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What's a barrier, anyway? 


UAV 


Execute indirect 
source 

Transition J 



v/ U 1 Vm* 


UAV to resource 

• If done badly, it will cost - flush or wait for idle 

• If done correctly, those transitions can be free 
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Missing barriers 

• Format problems - GPU/driver specific 
corruption 

• Synchronization problems - time- 
dependent corruption 



Subresources 
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Subresources 

• Need to be tracked individually 

• Downsampling 

• Shadow map atlas 

• If you transition all subresources, use 

D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES 

instead of going one-by-one 
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Placed resources & initial states 

• Render targets created as placed 
resources etc. must be cleared before 
use 

• Go into clear state directly, don't start 
with some random state and transition 
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Unnecessary transitions 

• Transitioning to wrong type • Read-read transitions 

• Not common but still occasionally happens • Moving between two read states, i.e. from 

• Make sure to check with validation layer index buffer to shader resource 

• Moving to union of all future states requires 
only one barrier 



X 

/ 
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Costly transitions 

• COMMON is for copies/present, not a general 
"catch all" state 

• Usually you want shader access 

• In D3D12: PS_RESOURCE | NON_PS_RESOURCE 

• In Vulkan: vk access shader read bit 
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Barrier control - Worst case #1 


• Worst-case barrier system - too many barriers 

• Material system going wrong 

• For maximum damage, do it per stage 


Shader stage #1 


Buffer 0 
Texture 0 


Shader stage #2 


Texture 0 
Texture 1 


Texture 1 


Texture 2 
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Barrier control - Worst case #1 


• "Late binding", or fixing up resources per draw 

• for (auto& stage : stages) { 

for (auto& resource : resources) { 

if (resource. state & STATE_READ == 0) { 

ResourceBarrier (1, Sresource. Barrier (STATE_READ) ) ; 

} 

} 

} 

• Let's take a look what happens here! 
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Barrier control - Worst case #1 

• Ideal flow 


Write access Draw Draw Draw Draw Draw 


• Per material/stage anti-pattern 

• One barrier per stage per resource 

• Barriers scattered all over the command list 


Write access Draw Draw Draw Draw Draw 


• In the worst case, multiple wait-for-idle back-to-back 
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Barrier control - Worst case #2 

• "Base state" or redundant transitioning 

• Transition to target state followed by restore 


Base state Target state 


Base state 


.11 JL_ _J1 


Render 

K 



Render 

K 

Shader 

target 


Copy Source 

Copy i==^ 

target 


resource 


II 


Not actually used - just transitioned back 





Funny barriers 

• ResourceBarrier (0, nullptr) 

• Nothing changed, thank you! 

• Indicates your state tracking is doing the wrong thing 

• Previous state equal to next state 

• Happens more than you believe - just say no 

• Always remember - driver assumes you're doing the 
optimal thing, doesn't go through any heuristic itself! 
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Get ready for the future 

• You should not have to track all resource state 

• 99% of your resources are immutable - read-only. 
Trust me © 

• Find "transitions" points - when do passes end? 

• Batch barriers here 

• Only transition what you need 


Design 
f i rst ! 


G-Buffer 

Shadow maps 

Shading 

Post 



i 



Batch transitions Batch transitions Batch transitions 
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Barrier debugging tips 

• Have a write/read bit 

• Log all transitions 

• Grep & spreadsheets are your friends 

• Check for # transitions, transition type, etc. 

• Number of transitions should be in the order of number 
of writable resources 

• Again, log and grep are your friends 

• If it's over 9000, something is fishy! 


Barrier debugging tips 

• Have a barrier-everything mode 

• Same as the "worst-case" mode described previously 

• For debugging only 

• Ensure your resources are in a known state at 
least once per frame 

• For example, at frame end/start 

• Transition everything into a known state - that resolves 
problems like TAA or shadow atlas breakage 
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Going forward 

• Even better, eventually 


Write access Draw Draw Draw Draw Draw 




Write access Draw Draw Draw Draw Draw Draw Draw 


i i 

t t 

• Give driver time to handle the 
transition 

• "Split barrier" in D3D12 

• vkCmdSetEvent + vkCmdWaitEvents 
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Summary: Barriers 

• Make sure to transition all the resources 
that need it (but not more) 

• Go into the most specific state you can 

• Remember you can combine various 
states 
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Launch control 

• How to feed the GPU 

• Submitting command lists, first and foremost 

• Per-frame resource updates & tracking second 



CPU threading 


CPU core 



1 1 1 1 
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CPU threading 

• Don't limit parallelism by assigning cores 
manually 

• Use a task/job system 

• Uses all cores automatically 

• Requires extra care for efficient work submission and 
resource syncronization 


6 GPUView: C:\Program Files {x86)\Windows Kits\10\Windows Peiformance Toolkit\gpuview\Merged.etl STime=25245072 Duration=2221944 

File View loots Charts Options Help 


252 5 


Adapter [AMD Radeon (T M) R9 Fury Series] 

Hardware Queue 






m 


-RFFFF 


llff 




TFfFfF 


190395 (219 0355ms) 


Hardware Queue 

Copy 

Hardware Queue 

Copy 


»»0 1>no«0 (0 OOOOmi 



Device Context 



Paging Queue 


Build: 10240.16515.150916-2039 amd64fre 


Procs: 8 Events Lost 0 Buffers Lost 0 Pointer Size 8 Level: 2 25772059 (2,577.2059ms) 


29900 (2.9900ms) 
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What happened? 

• Thread pool gone wild © 

• CPU tasks submitted work at the end 

• Task boundary became CPU/GPU sync point 

• Take control over the command lists after the tasks 
have finished 


© 


GPUView: C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview\Merged.etl STime=39066620 Duration=692987 


C 




Evict Paging Queue 



Build: 10565.0.151006-2014 amd64fre 


Procs: Events Los' Buffers Los Pointer Siz Level: 39427863 (3,942.7863n0 (0.0000ms) 
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What happened? 

• Each fence is basically a wait-for-idle on the GPU (more or less) 

• Better: 

• Protect per-frame resources 

• Unlikely you can start working on a command list "mid-frame" anyway 

• Protect many resources with a single fence 

• Make sure your job system can do this 

• Batch up submissions as much as possible 

• Submit early to keep the GPU busy at all times 



Ideal submission 



Start build Submit 


CL = Command list 







Command allocators 


• Command allocators are defined to be "grow only" 

• Record 100 draw calls on fresh allocator will allocate memory 

• Resetting and recording the same draw calls again will not allocate memory again 

• Try to reuse command allocators for similar workloads 

• Recycling allocators will grow them to the worst-case size 

• In total, number of allocators should be roughly 
# threads x # frames buffered x # GPUs 

• We've seen 20.000 allocators being allocated - lots of memory waste 

• Make sure to reuse allocators/command lists and don't recreate 
per frame 



Designing for Multithreading 
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Also: Renderpasses 

• Build a high-level graph of your frame 

• Tell the Tenderer about it via Vulkan's 
render-passes and subpasses 

• Allows the driver to pick an optimal 
schedule 
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Also: Renderpasses 


Allows you to express "don't care" nicely 

Much more about this can be found in the 

"Vulkan Fast Paths" talk 


GDC 


<> 


GAME DEVELOPERS CONFERENCE March 14-18, 2016 ■ Expo: March 18-18, 2016 #GOC16 


Debugging hints 

• Have an option to submit all command lists in one 
submission 

• Helps with timing issues 

• If not possible, you have in-frame GPU/CPU synchronization © 

• Have an option to wait for any command list 

• Helps with upload/resource synchronization 

• Some resource gets corrupted? Flush the GPU before updating it 
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Summary: Submission 

• Track resources at a per-frame 
granularity 

• Know your frame structure 

• Threading is essential to get good CPU 
usage 
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Adapter [AMD Radeon (TM) R9 Fury Series] 
Hardware Queue 

3D 


Hardware Queue 

Copy 

Hardware Queue 

Copy 


Hardware Queue 
compute a 
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Adapter [NVIDIA GeForce GTX 980 Ti] 

Hardware Queue 

3D 








Hardware Queue 

Copy 

a 


Hardware Queue 

Copy 


Hardware Queue 

Compute 


Multi-Queue 

• D3D12 and Vulkan expose multiple queue types: Copy, 
graphics, compute 

• On Vulkan, check the queue capabilities and how many are present 

• On D3D12, one of every kind is guaranteed to be available - but no 
scheduling guarantees are given 

• Compute queue is getting a lot of good use 

• Copy queue is not used much - could use more love 
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Graphics and Compute 

• We see great results from async compute so far 

• Run compute load while graphics queue is idle 

• We typically see one compute command list running 
parallel with one fence for sync 

• That's fine 

• The more compute the better © 
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Async compute 

• Pit of success 


G-Buffer + Z-Buffer 

Shadow maps 

Shading 

Post- 

Processing 


SSAO, light tile classification 



Different bottlenecks - 
maximized GPU usage with 
async 
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Async compute 

• Pit of no success 


G-Buffer + Z-Buffer 

Shadow maps 

Shading 

Post- 

Processing 


SSAO, light tile classification 



Resource competition - can be 
worse than running 
sequentially 
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Async compute 

• Pit of even more success 


G-Buffer + Z-Buffer 

Shadow maps 

Shading 

Post-Processing 

SSAO, light tile classification 






Design 
f i rst ! 


Actual frame end - frames overlap 







Copy to the rescue? 

• Copy queue is low-latency, low-speed, but it's separate hardware 

• Copy queue is optimized for transfer over PCIe, not for GPU local copies 

• For PCIe, it is the fastest way to transfer data 

• Avoid waiting on copy queue from graphics/compute 

• Ideal use of copy queue is streaming data over a few frames 

• Haven’t seen much use so far 

• Talk to us why? 

• For copying between adapters, copy queue is also best - consider shared 
swapchain though 
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Summary: Multi-queue 

• Use the compute queue to fill up the GPU 

• Use copy queue to saturate PCIe 

• Know your frame structure to find the 
best location to schedule async work 




Resources 


• On average, things work just fine 

• Uploads rarely a problem, but remember to look at the copy queue 

• On-GPU management mostly ok 

• Packing sometimes not as tight as it could be, check alignment! 

• For "high-frequency" resources like frame buffers, prefer 
CreateCommittedResource in D3D12 

• Lots of issues with residency and budget 

• Time travel back to yesterday and watch Dave Oldcorn's & Stephan Hodes' talk "Right on 
Queue - Advanced DirectX12 programming" [If time travel is not invented until the talk 
replace with presentation URL] 

• It's an ugly topic - too much to cover here. Talk to me afterwards! 
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Debug runtime & Validation layers 

• D3D12 and Vulkan have validation layers 

• The driver does not validate for performance reasons 

• We assume your application is perfect 

• During development, make sure to pass validation warning/error free 

• If your app doesn't support validation, add support for that now! 

• Any undefined behavior will bite you, especially with Vulkan - much wider hardware 
variety 

• Please don't play spec lawyer yourself - if something is unclear or in doubt, 
contact IHV partner to clarify 

• Spec and validation layers are constantly evolving 

• Various corner cases haven't been fully understood yet 



Mysteries that need more R&D 

• Executelndirect 

• Haven't seen serious problems with this yet 

• Mostly used for draw auto and dispatch indirect - we expect more crazy use down the line 

• See "Optimizing the Graphics Pipeline With Compute" on Friday 

• Bundles 

• Not enough game experience yet 

• Unclear how to get performance out of it - we're still gathering data 

• mGPU 

• Not enough game experience yet but in general seems to be "easy" enough 

• Copies through system memory should go on copy queue 

• Shared swapchain is good - but needs Windows 10 1511 
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Closing remarks 

• Vulkan and D3D12 deliver on their promises 

• Require additional thought 

• Just trying to reimplement D3D11 does not provide a benefit! 

• Engines require re-thinking to take advantage of the explicit APIs going 
forward 

• Many driver issues are now app issues 

• Synchronization (barriers!) 

• Memory management (uploads, residency) 

• This means you have the power to fix most issues! 
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Who's awesome? You're Awesome! 



@jasperbekkers 



@baldurk 



@gwihlidal 



@maverikou 



@martinjifuller 


@repi 


Dean Sekulic 


Markus Rogowsky 



@dankbaker 


Raymund Fulop 



Thanks to Kerbal Space 
Program to let me use 
screenshots! Go Jebediah 







